Experiment 1: Modelling with and without causality

The goal of this experiment is to check if having causality information of the variables of our data set is useful to improve our models.

The experiment has the following steps:

  1. Define a causal graph (Directed Acyclic graph) relating five (5) hypothetical variables. Some of them will be supposed to be unknown.
  2. Define causal rules between those variables
  3. Generate a data set following the rules
  4. Train a general ML model on the data set in order to predict the effect of X into Y
  5. Train a general ML model on the data set using the causal information in order to predict the effect of X into Y.
  6. Compare the performance of the two approaches:
    1. On unseen data that follows the distribution of the training set
    2. On unseen data that does not follow the distribution of the training set

1. Define a causal graph

Here I define a causal graph whose relations may cause problems if not adequately treated. Nodes in gray will not be visible in the dataset.

In [1]:
import numpy as np
import pandas as pd
import pandas_profiling as pp
import plotly_express as px

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

import networkx as nx
from nxpd import draw

# np.random.seed(seed=42) # Test set is not representative
np.random.seed(seed=22)
In [2]:
G = nx.DiGraph()
G.graph['dpi'] = 120
G.add_nodes_from(['X', ('A',{'color':'gray'}), 'B', ('C',{'color':'gray'}), 'Y'])
G.add_edges_from([('A','X'),('A','B'),('C','B'),('C','Y'),('X','Y')])
draw(G, show='ipynb')
Out[2]:

The problem presented to the ML practitioner will be:

Can you estimate the real influence of X in Y?
I mean, what happens to Y when I increase X by one unit, keeping all the same equal?

2-3. Define causal rules between variables and create data set

Let's define some rules to later generate a data set:

In [3]:
n = 10000
In [4]:
sigma_A = 2
mu_A = -2
A = sigma_A * np.random.randn(n,1) + mu_A

sigma_C = 8
mu_C = 3
C = sigma_C * np.random.randn(n,1) + mu_C
In [5]:
B = 5*A - 2*C + np.random.randn(n,1)/10
X = -3*A + np.random.randn(n,1)/10
Y = X + 2*C + np.random.randn(n,1)
In [6]:
df_data = np.concatenate((A,B,C,X,Y), axis=1)
df = pd.DataFrame(data=df_data, columns=['A','B','C','X','Y'])
df.head()
Out[6]:
A B C X Y
0 -2.183900 -6.623908 -2.180563 6.541453 2.950146
1 -4.926701 -31.113046 3.263973 14.814058 19.266795
2 0.163583 -33.534884 17.229605 -0.460847 34.022308
3 -2.478650 -13.149468 0.397586 7.294810 9.476957
4 -2.982258 -0.705415 -7.144619 8.881095 -5.773796

Prepare training set

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='Y'), df[['Y']], test_size=0.3, random_state=42)
In [8]:
y_train = np.squeeze(y_train)
y_test  = np.squeeze(y_test)
In [9]:
df_real = df[['B','X','Y']]
X_train_real = X_train[['B','X']]
X_test_real = X_test[['B','X']]
df_real.head()
Out[9]:
B X Y
0 -6.623908 6.541453 2.950146
1 -31.113046 14.814058 19.266795
2 -33.534884 -0.460847 34.022308
3 -13.149468 7.294810 9.476957
4 -0.705415 8.881095 -5.773796

Explore data set

In [10]:
pp.ProfileReport(df, style={'full_width':True})
Out[10]:

In [11]:
pp.ProfileReport(df_real, style={'full_width':True})
Out[11]:

In [12]:
px.scatter(data_frame=df, x='X', y='Y')
In [13]:
px.scatter(data_frame=df, x='B', y='Y')

4. Train a general ML model on the data set

Let's try a Linear Regression. Both X and mainly B are highly correlated with Y, so I will use both variables:

In [14]:
linear_model = LinearRegression()
linear_model.fit(X_train_real, y_train)
predictions_linear_model = linear_model.predict(X_test_real)
print("MAE in test set = {}".format(np.round(mean_absolute_error(predictions_linear_model, y_test), decimals=3)))
MAE in test set = 0.813

It looks like the model has learned the patterns in the data set.

In [15]:
X_train_real.columns
Out[15]:
Index(['B', 'X'], dtype='object')
In [16]:
linear_model.coef_
Out[16]:
array([-1.00070872, -0.6663677 ])

The model says that for each extra unit of X, Y is expected to decrease by 0.67.

But remember that:

Y = X + 2*C

So the influence of X in Y should be around 1, not -0.67 (!)

5. Train a general ML model using causal information

Looking at the causal graph, it is clear that, in order to get the isolated influence of X we need to ignore B, as considering it will open the collider path and create a counfounding situation.

So let's try the same model, but training only on X:

In [17]:
X_train_causal = X_train[['X']]
X_test_causal = X_test[['X']]
In [18]:
causal_linear_model = LinearRegression()
causal_linear_model.fit(X_train_causal, y_train)
predictions_causal_linear_model = causal_linear_model.predict(X_test_causal)
print("MAE in Test set = {}".format(np.round(mean_absolute_error(predictions_causal_linear_model, y_test), decimals=2)))
MAE in Test set = 13.06

The model is quite worse when predicting Y (as expected; it is only considering the influence of X).

In [19]:
causal_linear_model.coef_
Out[19]:
array([1.04990426])

This time, the model predicts that for each unit increment in X, Y is expected to increase by 1.05, which is a much better approximation to the real effect value (1).

6A. Performance comparison with unseen data similar to the training set

Let's take a reference value:

In [20]:
X_test.head(1)
Out[20]:
A B C X
6252 -2.214132 -21.307698 5.169252 6.707582

What would happen if I increase X by one unit?

In [21]:
do_X_test = X_test.copy()
do_X_test[['X']] = do_X_test[['X']] + 5.
In [22]:
do_X_test.head(1)
Out[22]:
A B C X
6252 -2.214132 -21.307698 5.169252 11.707582

ML answer:

In [23]:
predictions_ml = linear_model.predict(do_X_test[['B','X']])

Causal inference answer:

In [24]:
predictions_causal_inference = y_test + 5 * 1.05

Real answer:

In [25]:
real_answer = do_X_test.X + 2.0 * do_X_test.C
In [26]:
print("MAE ML: {}".format(np.round(mean_absolute_error(real_answer, predictions_ml), decimals=2)))
print("MAE Causal Inference: {}".format(np.round(mean_absolute_error(real_answer, predictions_causal_inference), decimals=2)))
MAE ML: 8.33
MAE Causal Inference: 0.82

The causal inference method performs much better!
This is not surprising after seeing the estimation of coefficients made by each method.

Conclusion

This is a very simple (silly?) experiment, but shows some important truths:

  1. Machine learning models (statistics in general) are very good at predicting when the unseen data is similar to the training data set, but they don't necessarily capture the real underlying patterns.

  2. If we want to predict what will happen if we change a variable, ML will not help, as we are artificially changing the data set distribution.

  3. Causal inference is like an extra dimension that lets us see things we couldn't see before.